Classifying test data based on a maximum margin classifier

ABSTRACT

Systems and methods for classifying binary data based training data having a predefined sample size is obtained. The training data is composed of separable binary datasets. An exact bound on Vapnik-Chervonenkis (VC) dimension of a classifier for the training data is determined. The exact bound is based one or more variables defining the hyperplane. The exact bound may be minimized for generating a classifier for predicting one class to which a given data sample of the training data belongs.

BACKGROUND

Learning machines utilize a variety of training approaches for analyzingdata and recognizing patterns. As part of such approaches, the learningmachines are trained to generalize using data with known outcomes. Oncesuch learning machines are trained, they may be subsequently used forclassification of actual data in cases where the outcome is unknown. Forexample, a learning machine may be trained to recognize patterns indata. Learning machines may be trained to solve a wide variety ofproblems across a variety of disciplines. An example of such a learningmachine is a support vector machine (SVM). It should be noted that thedata to be analyzed may correspond to a variety of technical fields,such as biotechnology, and image processing.

BRIEF DESCRIPTION OF THE DRAWINGS

The following detailed description references the drawings, wherein:

FIG. 1 is a block diagram of a data classification system, as per animplementation of the present subject matter; and

FIG. 2 is a flowchart of a method for classifying data, as per anexample of the present subject matter.

SUMMARY

This summary is provided to introduce concepts related to systems andmethods for cutting holes onto a sheet-metal assembly. The concepts arefurther described below in the detailed description. This summary is notintended to identify essential features of the claimed subject matternor is it intended for use in determining or limiting the scope of theclaimed subject matter.

Systems and methods for classifying data and enable learning ofmachines, are described. In one implementation, training data having apredefined sample size is obtained. In the present implementation, thetraining data is composed of separable datasets. Subsequently, aVapnik-Chervonenkis (VC) dimension for the training data is determined.Based on the VC dimension, an exact bound on the VC dimension is furtherdetermined. On obtaining the exact bound on the VC dimension, the exactbound is minimized. Based on the minimizing of the exact, a classifieris obtained. The generated classifier may be used for predicting atleast one class to which samples of the training data belong.

These and other aspects of the present subject matter are furtherdescribed in conjunction with the detailed description, as provided inthe sections below:

DETAILED DESCRIPTION

Recent developments in technology have seen an increase in usage ofcomputing devices. Such computing devices may be used in variety oftechnological fields, such as image processing, searching, biotechnology(gene classification), and others. In such cases, the computing devicesmay perform a variety of operation based on volumes of data. Processingof data is typically implemented using computing programs and predefinedrule/conditions which are rigid.

However, for certain objectives, such functionalities may not beefficiently carried out using programming alone. Example applicationsinclude spam filtering, optical character recognition (OCR), and searchengines, to name a few. In such cases, computing devices may followapproaches which rely on data processing models which are based onpresently available data. The available data includes input data andknown outcomes corresponding to the input data. Based on the availabledata, various prediction or decisions may be implemented rather thancarrying out such decisions based on rigid programmed instructions.

An example of such computing devices includes support vector machines(SVMs). Prior to determining which category or class a given occurrencemay correspond to, a stage of learning is implemented. During thelearning stage, given a set of training examples, each marked asbelonging to one of two categories, an SVM training algorithm builds amodel that assigns new occurrences into one category or the other. Aspart of learning stage, a classifier may be obtained. The classifier maybe considered as a logical separation which separates two or moreclasses or groups to which the training examples may relate to.Generally, the classifier may be determined based on the characteristicsof the training examples themselves. The type of the classifier may inturn be dependent on the type of the training examples themselves. Ifthe training examples are linearly separable, the classifier may be alinear classifier. An example of a linear classifier may include astraight line or a plane if the training examples may be represented ina Euclidean space. In case the instances of the training examples arenon-linearly separable, the resulting classifier may be a non-linearfunction.

The determination of such classifiers may be carried out throughcomputing devices. Such computing devices may, based on characteristicsof training examples, determine the appropriate classifier. Inoperation, such computing devices may tend to obtain the classifier bygeneralization of the training examples to obtain a model based on whichsubsequent decisions on given data may be carried out. An example ofsuch generalization is Vapnik-Chervonenkis dimension (VC dimension),which measures a capacity of a classification approach. As isunderstood, capacity of any classification approach also provides anindication of the complexity. For example, any classifier which ischaracterized by a high VC dimension is complex, and a classifiercharacterized by a low VC dimension is considered as less complex. It istherefore desired that any generalizations which are carried out, are tohave a low VC dimensions. As should be noted, any computing device basedon a low VC dimension would tend to generalize better when compared withsystem having high VC dimensions, as such systems would tend to overfitwhile obtaining a classifier. It is for this reason a classifier beingcharacterized by low VC dimension would be desired.

While obtaining a classifier with low VC dimensions, the entire trainingmay have to be analyzed to obtain the classifier. In such a case, allcharacteristics or features of the training data may be utilized forobtaining the classifier. However, this approach may typically involveconsidering all characteristics of the training data. This may in turnrequire considerable processing resources and may not provide anaccurate classifier which most suitably distinguishes between differentclasses to which the training data may correspond to.

In the case of a nonlinear classifier, the VC dimension is related tothe number of support vectors used by the classifier. The number ofcomputations required to be performed when testing a test sample whoseoutcome or result is not known, is proportional to number of supportvectors. The support vectors are typically a subset of the training set.The storage or memory cost of the trained learning machine is alsoproportional to the number of support vectors. The number of supportvectors thus has an impact on the run time of an application using sucha learning machine. On a portable or embedded device such as asmart-phone, the speed of processing, the energy consumption, andconsequently, the battery life, depends considerably on the number ofcomputations and data access. Furthermore, the manners in which theclassifiers for SVMs are obtained depend on the solving of quadraticfunctions. Solving such functions require consideration processing andstorage resources. Implementing such mechanisms for hand-held computingdevices may not therefore be efficient.

To this end, approaches for classifying data are described. In oneimplementation, the classification of the data is based on a maximummargin classifier having a low VC dimension. The low VC dimensionclassifier is obtained based on functions which form the exact bound onthe VC dimension. Once the exact bound on the VC dimension is obtained,the same is minimized to obtain the classifier. As would be explained inthe following sections, the classifier thus obtained is of low VCdimension. Furthermore, the classifier is obtained by consideringnon-redundant and only essential characteristics of the training data.In such a case, the processing required is less and the process ofobtaining the classifier is efficient. Furthermore, since the basis onwhich the classification is performed also may involve less number offeatures, the process of classification is fast and more efficient.Various experimental results are also shared in the followingdescription indicating the increased efficiency in which theclassification is carried out.

Aspects the present subject matter meet the above-identified unmet needsof the art, as well as others, by providing computing systems forrecognizing patterns and significant discriminative features in data,such as images, and bio-informatics databases, building classifiersusing such data, and providing predictions on other data whose result oroutcome is not known. In particular, aspects of the present subjectmatter implement computing devices for recognition of images such ashandwritten or printed characters, text, or symbols, such as handwrittentext, characters, or symbols. These may be used for analyzing biologicaland medical information, such as the gene expression data provided bymicroarrays.

The above mentioned implementations are further described herein withreference to the accompanying figures. It should be noted that thedescription and figures relate to exemplary implementations, and shouldnot be construed as a limitation to the present subject matter. It isalso to be understood that various arrangements may be devised that,although not explicitly described or shown herein, embody the principlesof the present subject matter. Moreover, all statements herein recitingprinciples, aspects, and embodiments of the present subject matter, aswell as specific examples, are intended to encompass equivalentsthereof.

FIG. 1 depicts an exemplary data classification system 100 implementedas a computing-device, for carrying out a hole forming process for asheet-metal assembly. The data classification system 100 may beimplemented as a stand-alone computing device. Examples of suchcomputing devices include laptops, desktops, tablets, hand-heldcomputing devices such as smart-phones, or any other forms of computingdevices. Continuing with the present implementation, the dataclassification system 100 may further include a processor(s) 102,interface(s) 104 and memory 106. The processor(s) 102 may also beimplemented as signal processor(s), state machine(s), logic circuitries,and/or any other device or component that manipulate signals based onoperational instructions.

The interface(s) 104 may include a variety of interfaces, for example,interfaces for data input and output devices, referred to as I/Odevices, storage devices, network devices, and the like, forcommunicatively associating the data classification system 100 with oneor more other peripheral devices. The peripheral devices may be input oroutput devices communicatively coupled with the data classificationsystem 100. The interface(s) 104 may also be used for facilitatingcommunication between the data classification system 100 and variousother computing devices connected in a network environment. The memory106 may store one or more computer-readable instructions, which may befetched and executed for carrying out a forming process for asheet-metal assembly. The memory 106 may include any non-transitorycomputer-readable medium including, for example, volatile memory, suchas RAM, or non-volatile memory such as EPROM, flash memory, and thelike.

The data classification system 100 may further include module(s) 108 anddata 110. The module(s) 108 may be implemented as a combination ofhardware and programming (e.g., programmable instructions) to implementone or more functionalities of the module(s) 108. In one example, themodule(s) 108 includes a data classification module 112 and othermodule(s) 114. The data 110 on the other hand includes training data116, classifier 118, and other data 120.

In examples described herein, such combinations of hardware andprogramming may be implemented in a number of different ways. Forexample, the programming for the module(s) 108 may be processorexecutable instructions stored on a non-transitory machine-readablestorage medium and the hardware for the module(s) 108 may include aprocessing resource (e.g., one or more processors), to execute suchinstructions. In the present examples, the machine-readable storagemedium may store instructions that, when executed by the processingresource, implement module(s) 108 or their associated functionalities.In such examples, the data classification system 100 may include themachine-readable storage medium storing the instructions and theprocessing resource to execute the instructions, or the machine-readablestorage medium may be separate but accessible to data classificationsystem 100 and the processing resource. In other examples, module(s) 108may be implemented by electronic circuitry.

In operation, the data classification system 100 may receive set oftraining data. In one implementation, the training data may includeexamples which correspond to two or more distinct classes. The trainingdata may further be linearly separable or non-linearly separable. In oneexample, the training data may be obtained from training data 116. Oncethe training data 116 is obtained, the data classification module 112may further determine a Vapnik-Chervonenkis (VC) dimension correspondingto the training data 116. Once the VC dimension is obtained, the dataclassification module 112 may further determine an exact bound for theVC dimension. The exact bound may be considered as upper and lowerlimits for the VC dimension. Subsequently, the data classificationmodule 112 may minimize the exact bounds on the VC dimension to obtainthe classifier. In one example, the exact bound may be a function of adistance of closest distance of a point from amongst the training datafrom a notional hyperplane. The notional hyperplane may be such that itclassifies plurality of points within the training data with zero error.In one implementation, the notional hyperplane may be expressed usingthe following expression:u ^(T) x+v=0

-   -   wherein u^(T) is a row vector containing the same n elements in        that order    -   v: is a scalar denoting the offset or bias associated with the        hyperplane u^(T)x+v=0

The operation of the data classification system 100 is further explainedin conjunction with the following relations. It should be noted that thefollowing relations are only exemplary and should not be construed as alimitation. Other relations expressing the same or similar functionalitywould also be within the scope of the present subject matter.

In the present implementation, the training data 116 may include binaryclassification datasets for which a classifier is to be determined. Thetraining data 116 may include data points x^(i), i=1, 2, . . . , M, andwhere samples of class +1 and −1 are associated with labels y^(i)=1 andy^(i)=−1, respectively. For the present training data 116, the dimensionof the input sample is assumed to be n.

It should be noted that the set of all gap tolerant hyperplaneclassifiers with margin d>d_(min), the VC dimension is bounded by thefollowing function:

$\begin{matrix}{\gamma \leq {1 + {{Min}\left( {\frac{R^{2}}{d_{\min}^{2}},n} \right)}}} & (1)\end{matrix}$where R is the radius of the smallest sphere enclosing all the trainingsamples.

Equation (1) suggests that minimizing the machine complexity requires

maximizing the margin as well as minimizing R². Since the squarefunction increases monotonically, and since both R and d_(min) arepositive quantities, in one implementation, the data classificationsystem 100 minimizes R/dmin. In another implementation, the dimension nis large.

As mentioned previously, the training data 116 may be linearly separableor non-linearly separable. For the implementation where the trainingdata 116 is linearly separable, a notional hyperplane may exist whichcan classify these points with zero error, which can be represented bythe following relation:u ^(T) x+v=0

With the above relation, the margin may be considered as the distance ofthe closest point within the training data 116, from the hyperplane, andis given by:

$\begin{matrix}{\underset{{i = 1},2,{\ldots\mspace{14mu} M}}{Min}\frac{{{u^{T}x^{i}} + v}}{u}} & (2)\end{matrix}$

From the above, the following relation may also be derived:

$\begin{matrix}{\frac{R}{d} = \frac{{Max}_{{i = 1},2,{\ldots\mspace{14mu} M}}{x^{i}}}{{Min}_{{i = 1},2,{\ldots\mspace{14mu} M}}\frac{{{u^{T}x^{i}} + v}}{u}}} & (3)\end{matrix}$which may also be represented as:

$\begin{matrix}{\frac{R}{d} = \frac{{Max}_{{i = 1},2,{\ldots\mspace{14mu} M}}{u}{x^{i}}}{{Min}_{{i = 1},2,{\ldots\mspace{14mu} M}}{{{u^{T}x^{i}} + v}}}} & (4)\end{matrix}$

Since in the present implementation, gap-tolerant classifiers with amargin d≥dmin are considered, we have

$\begin{matrix}{\mspace{79mu}{{\underset{x^{i}}{Min}\frac{{{u^{T}x^{i}} + v}}{u}} \geq d_{\min}}} & (5) \\{\mspace{79mu}{\left. \Longrightarrow{u} \right. \leq \frac{{Min}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}{d_{\min}} \leq \frac{{Max}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}{d_{\min}}}} & (6) \\ & (7) \\\; & \;\end{matrix}$

This gives:

$\begin{matrix}{{{u}\underset{{i = 1},2,\ldots\mspace{14mu},M}{Max}{x^{i}}} \leq {\underset{{i = 1},2,\ldots\mspace{14mu},M}{Max}{{{{u^{T}x^{i}} + v}} \cdot \frac{{Max}_{{i = 1},2,\ldots\mspace{14mu},M}{x^{i}}}{d_{\min}}}}} & (8) \\{{{{u}\underset{{i = 1},2,\ldots\mspace{14mu},M}{Max}{x^{i}}} \leq {\beta\underset{{i = 1},2,\ldots\mspace{14mu},M}{Max}{{{u^{T}x^{i}} + v}}}},} & (9)\end{matrix}$where β is a constant independent of u and v, and dependent only on thedataset and the choice of d_(min).

In order to determine the classifier, the data classification system 100is to obtain solution for the following relation:

$\begin{matrix}{\underset{u,v}{Minimize}\frac{{Max}_{u,{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}{{Min}_{u,{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}} & (10)\end{matrix}$

Since the training data 116 is linearly separable, it may also berepresented as:

$\begin{matrix}{{{{u^{T}x^{i}} + v} \geq 0},{{{if}\mspace{14mu} y_{i}} = 1\left( {{Class}\mspace{14mu} 1\mspace{14mu}{points}} \right)}} & (11) \\{{{{u^{T}x^{i}} + v} \leq 0},{{{if}\mspace{14mu} y_{i}} = {{- 1}\left( {{Class}\mspace{14mu} - {1\mspace{14mu}{points}}} \right)}}} & (12) \\{{{{u^{T}x^{i}} + v}} = \left\{ \begin{matrix}{{{u^{T}x^{i}} + v},} & {{{{if}\mspace{14mu} u^{T}x^{i}} + v} \geq 0} \\{{- \left( {{u^{T}x^{i}} + v} \right)},} & {{{{if}\mspace{14mu} u^{T}x^{i}} + v} \leq 0}\end{matrix} \right.} & (13)\end{matrix}$

From the above, the following can be gathered:∥u ^(T) x ^(i) +v∥=y _(i)·[u ^(T) x ^(i) +v], i=1,2, . . . ,M  (14)

It should be noted that the product of the class labels with thedistance from the hyperplane is always a non-negative quantity.Considering the above:

$\begin{matrix}{\mspace{79mu}{\underset{u,v,g,l}{Min}\frac{g}{l}}} & (15) \\{\mspace{79mu}{{g \geq {y_{i} \cdot \left\lbrack {{u^{T}x^{i}} + v} \right\rbrack}},\mspace{79mu}{i = 1},2,\ldots\mspace{14mu},M}} & (16) \\{\mspace{79mu}{{l \leq {y_{i}\left\lbrack {{u^{T}x^{i}} + v} \right\rbrack}},\mspace{79mu}{i = 1},2,\ldots\mspace{14mu},M}} & (17) \\{} & (18)\end{matrix}$

As would be understood, the above expression provided by equation 15 isa linear fractional. In one implementation, the function as described byEquation 15 may be transformed to a linear function by the dataclassification module 112. For example, the data classification module112 may apply a Chames-Cooper transformation, to obtain a linearfunction. In one implementation, the following is obtained:

$\begin{matrix}{{\underset{u,v,g,p,l}{Min}h} = {g \cdot p}} & (19) \\{{{g \cdot p} \geq {y_{i} \cdot \left\lbrack {{{p \cdot u^{T}}x^{i}} + v} \right\rbrack}},{i = 1},2,\ldots\mspace{14mu},M} & (20) \\{{{l \cdot p} \leq {y_{i} \cdot \left\lbrack {{{p \cdot u^{T}}x^{i}} + v} \right\rbrack}},{i = 1},2,\ldots\mspace{14mu},M} & (21) \\{{p \cdot l} = 1} & (22)\end{matrix}$

Denoting w≡p·u, and b≡p·v, and noting that p·l=1, we obtain thefollowing:

$\begin{matrix}{\mspace{79mu}{\underset{w,b,h}{Min}h}} & (23) \\{\mspace{79mu}{{h \geq {y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack}},\mspace{79mu}{i = 1},2,\ldots\mspace{14mu},M}} & (24) \\{\mspace{79mu}{{1 \leq {y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack}},\mspace{79mu}{i = 1},2,\ldots\mspace{14mu},M}} & (25) \\{} & (26)\end{matrix}$which in turn may be further represented as:

$\begin{matrix}{\underset{w,b,h}{Min}h} & (27) \\{{h \geq {y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack}},{i = 1},2,\ldots\mspace{14mu},M} & (28) \\{{{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} \geq 1},{i = 1},2,\ldots\mspace{14mu},M} & (29)\end{matrix}$

With the above expressions, the data classification module 112 mayfurther determine w and b by solving the above relations. In oneimplementation, the data classification module 112 may further obtainthe following function:f(x)=w ^(T) x+b  (30)

In one example, a number of features may be gained based on theEquations (27)-(28). The features may be selected for which the value ofthe w^(T) is non-zero. Once the features are selected, a classifier maybe obtained for points corresponding to such features which may allowfocussing on only the selected features for determining a classifier. Aswould be noted, determining a classifier based on the determinedfeatures would involve less processing resources, and betterclassification results. In another implementation, the feature selectionmay also be used for compression of data by selecting only the relevantsupport vectors. For decompression, the reconstruction of the data maybe based on such selected support vectors.

In another implementation, the method may involve using, instead of maxfunction, a “soft max” function. In such a case, distance of the pointswithin the data set is measured as a weighted function of distances froma plurality of hyperplanes. Similarly, the min function is replaced by a“soft Min” function.

Accordingly, the class of a test sample x may be determined based onvalues of y and the sign of the function as depicted by the Equation(30). In one example, the values w and b are stored in classifierparameters 118.

It should also be noted that in general, data sets will not be linearlyseparable. In one implementation, an error factor may be introduced tocounter any misclassification error. In such a case, the Equation (30)may be represented by the following Equations (31)-(34):

${\underset{w,b,h}{Min}\mspace{11mu} h} = {C \cdot {\sum\limits_{i = 1}^{M}q_{i}}}$h ≥ y_(i) ⋅ [w^(T)x^(i) + b] + q_(i), i = 1, 2, …  , My_(i) ⋅ [w^(T)x^(i) + b] + q_(i) ≥ 1, i = 1, 2, …  , Mq_(i) ≥ 0, i = 1, 2, …  , M.

The above description has been provided from the perspective of linearlyseparable datasets within the training data 116. In case of non-linearlyseparable datasets within the training data 116, the data classificationmodule 112 may further determine a mapping function ϕ(x) for mappinginput samples within the training data 116 to space having a higherdimension (i.e., >n).

In such a case, for the higher dimensioned space, a notional hyperplane(similar to the hyperplane as described above but qualified for a higherdimension and a function of Φ(x) may be defined by the dataclassification module 112:u ^(T) Φ(x)+v=0  (35)

wherein u denotes column vector containing n elements, in which theelements are variables denoted by u₁, u₂, . . . , u_(n) for the vectoru. The vector u is used to define a separating hyperplane.

φ(x) is nonlinear transformation or mapping.

Similar to Equations (31)-(34), the following Equations (36)-(39) may beobtained as a function of the mapping function, ϕ(x):

${\underset{w,b,h,q}{Min}\mspace{11mu} h} + {C \cdot {\sum\limits_{i = 1}^{M}q_{i}}}$h ≥ y_(i) ⋅ [w^(T)ϕ(x^(i)) + b] + q_(i), i = 1, 2, …  , My_(i) ⋅ [w^(T)ϕ(x^(i)) + b] + q_(i) ≥ 1, i = 1, 2, …  , Mq_(i) ≥ 0, i = 1, 2, …  , M.

The image vectors ϕ(x^(i)), i=1, 2, . . . , M may be considered as toform an overcomplete basis in the empirical feature space, in which walso lies. From the above, we can therefore also say:

$\begin{matrix}{{w = {\sum\limits_{j = 1}^{M}{\lambda_{j}{{\phi\left( x^{j} \right)}.{Therefore}}}}},} & (40) \\{{{{w^{T}{\phi\left( x^{i} \right)}} + b} = {{{\sum\limits_{j = 1}^{M}{\lambda_{j}{\phi\left( x^{j} \right)}^{T}{\phi\left( x^{i} \right)}}} + b} = {{\sum\limits_{j = 1}^{M}{\lambda_{j}{K\left( {x^{i},x^{j}} \right)}}} + b}}},} & (41)\end{matrix}$where K(p,q) denotes the Kernel function with input vectors p and q, andis defined asK(p,q)=ϕ(p)^(T)ϕ(q).  (42)

Based on the above, the operation of the data classification module 112may further continue to obtain the following Equations (43)-(46):

${{Min}_{w,b,h,q}h} + {C \cdot {\sum\limits_{i = 1}^{M}q_{i}}}$${h \geq {{y_{i} \cdot \left\lbrack {{\sum\limits_{j = 1}^{M}{\lambda_{j}{K\left( {x^{i},x^{j}} \right)}}} + b} \right\rbrack} + q_{i}}},{i = 1},2,\ldots,M$${{{y_{i} \cdot \left\lbrack {{\sum\limits_{j = 1}^{M}{\lambda_{j}{K\left( {x^{i},x^{j}} \right)}}} + b} \right\rbrack} + q_{i}} \geq 1},{i = 1},2,\ldots\mspace{14mu},M$q_(i) ≥ 0, i = 1, 2, …  , M.

Once the variables λ_(j)=1, 2, . . . , M and b are obtained, the classof a test point x can be determined by the data classification module112 based on the following function:

$\begin{matrix}{{\sum\limits_{j = 1}^{M}{\lambda_{j}{K\left( {x,x^{j}} \right)}}} + b} & (47)\end{matrix}$

In another implementation, the data classification module 112 mayfurther determine the classifier by extending the principles describedabove to other variants. By modifying the measures used for the errorterm in equations (30) and (43), we can obtain other methods forbuilding learning machines that may offer their own advantages. Forexample, a classifier for a least squares learning machine may beobtained by solving the following Equations (48)-(51):

$\begin{matrix}{\mspace{79mu}{{\underset{w,b,h}{Min}\mspace{11mu} h} + {C \cdot {\sum\limits_{i = 1}^{M}q_{i}^{2}}}}} & (48) \\{\mspace{79mu}{{h \geq {{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} + q_{i}}},\mspace{79mu}{i = 1},2,\ldots\mspace{14mu},M}} & (49) \\{\mspace{79mu}{{{{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} + q_{i}} \geq 1},\mspace{79mu}{i = 1},2,\ldots\mspace{14mu},M}} & (50) \\{\mspace{76mu}{}} & (51)\end{matrix}$with C as the error parameter.

It should be noted that each constraint in will always be met as anequality; if any is met as a strict inequality, then the constraint canbe met as an equality while reducing the value of the objective functionin Equation (48). Based on the above, the data classification module 112may obtain the classifier through the following equations (52)-(55):

$\begin{matrix}{\mspace{79mu}{{\underset{w,b,h}{Min}\mspace{11mu} h} + {C \cdot {\sum\limits_{i = 1}^{M}q_{i}^{2}}}}} & (52) \\{\mspace{79mu}{{h \geq {{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} + q_{i}}},\mspace{79mu}{i = 1},2,\ldots\mspace{14mu},M}} & (53) \\{\mspace{79mu}{{{{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} + q_{i}} = 1},\mspace{79mu}{i = 1},2,\ldots\mspace{14mu},M}} & (54) \\{\mspace{76mu}{}} & (55)\end{matrix}$

Note that the R.H.S. of (53) is identical to the L.H.S. of (54), andconstraint (54) indicates that this is equal to 1 at any solution.Hence, we note that h=1, at a solution. Therefore, the objectivefunction can be simplified as follows:

$\begin{matrix}{\mspace{79mu}{\underset{w,b}{Min}\mspace{11mu}{C \cdot {\sum\limits_{i = 1}^{M}q_{i}^{2}}}}} & (56) \\{\mspace{79mu}{{{{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} + q_{i}} = 1},\mspace{79mu}{i = 1},2,\ldots\mspace{14mu},M}} & (57) \\ & (58)\end{matrix}$

It is obvious to a person versed in the art that the multiplier C isredundant and can be removed, to yield the following:

$\begin{matrix}{\mspace{79mu}{\underset{w,b}{Min}\mspace{11mu}{\sum\limits_{i = 1}^{M}q_{i}^{2}}}} & (59) \\{\mspace{79mu}{{{{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} + q_{i}} = 1},\mspace{79mu}{i = 1},2,\ldots\mspace{14mu},M}} & (60) \\ & (61)\end{matrix}$

In one implementation, the data classification module 112, fornon-linear version of the above problem, which is the extension of theformulation using a kernel function, is given by:

$\begin{matrix}{\underset{w,b}{Min}{\sum\limits_{i = 1}^{M}q_{i}^{2}}} & (62) \\{{{{y_{i} \cdot \left\lbrack {{\sum\limits_{j = 1}^{M}{\lambda_{j}{K\left( {x^{i},x^{j}} \right)}}} + b} \right\rbrack} + q_{i}} = 1},{i = 1},2,\ldots\mspace{14mu},M} & (63)\end{matrix}$

In the above Equation, the error variables q_(i), i=1, 2, . . . , M maybe negative, zero, or positive. In one example, the data classificationmodule 112 may measure the sum of squares of the error variables. Inanother implementation, the data classification module 112 may selectother measures of the error variable, e.g. the L1 norm of the errorvector. Accordingly, the following Equations are obtained.

$\begin{matrix}{\underset{w,b}{Min}{\sum\limits_{i = 1}^{M}\left( {q_{i}^{a} + q_{i}^{b}} \right)}} & (64) \\{{{{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} + \left( {q_{i}^{a} - q_{i}^{b}} \right)} = 1},{i = 1},2,\ldots\mspace{14mu},M} & (65) \\{q_{i}^{a},{q_{i}^{b} \geq 0},{i = 1},2,\ldots\mspace{14mu},M} & (66)\end{matrix}$

Once the above Equations are obtained, the data classification module112 may further obtain its non-linear equivalent which is represented bythe following set of Equations:

$\begin{matrix}{\underset{w,b}{Min}{\sum\limits_{i = 1}^{M}\left( {q_{i}^{a} + q_{i}^{b}} \right)}} & (67) \\{{{{y_{i} \cdot \left\lbrack {{\sum\limits_{j = 1}^{M}{\lambda_{j}{K\left( {x^{i},x^{j}} \right)}}} + b} \right\rbrack} + \left( {q_{i}^{a} - q_{i}^{b}} \right)} = 1},{i = 1},2,\ldots\mspace{14mu},M} & (68) \\{q_{i}^{a},{q_{i}^{b} \geq 0},{i = 1},2,\ldots\mspace{14mu},M} & (69)\end{matrix}$

In one example, a parameter C may further be included to provide atrade-off between the VC dimension bound and the misclassificationerror. In yet another implementation, the data classification module 112may consider C as a variable and optimize its value to determine theoptimal tradeoff, to provide the following equations:

$\begin{matrix}{{\underset{w,b,h,C}{Min}h} + {C \cdot {\sum\limits_{i = 1}^{M}q_{i}}}} & (70) \\{{h \geq {{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} + q_{i}}},{i = 1},2,\ldots\mspace{14mu},M} & (71) \\{{{{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} + q_{i}} \geq 1},{i = 1},2,\ldots\mspace{14mu},M} & (72) \\{{q_{i} \geq 0},{i = 1},2,\ldots\mspace{14mu},{M.}} & (73)\end{matrix}$

The Equations (70)-(73) can be gathered as being quadratic. As would beunderstood, solving such quadratic functions are computationallyexpensive. In one example, the data classification module 112 may selectan appropriate value for C in determining solution for the abovementioned equations. Continuing with the above, the Equations (70)-(73)can also be represented as follows:

$\begin{matrix}{{\underset{w,b,h,C}{Min}h} + {C \cdot {\sum\limits_{i = 1}^{M}q_{i}}}} & (74) \\{{h \geq {{y_{i} \cdot \left\lbrack {{\sum\limits_{j = 1}^{M}{\lambda_{j}K\;\left( {x^{i},x^{j}} \right)}} + b} \right\rbrack} + q_{i}}},{i = 1},2,\ldots\mspace{14mu},M} & (75) \\{{{{y_{i} \cdot \left\lbrack {{\sum\limits_{j = 1}^{M}{\lambda_{j}K\;\left( {x^{i},x^{j}} \right)}} + b} \right\rbrack} + q_{i}} \geq 1},{i = 1},2,\ldots\mspace{14mu},M} & (76) \\{{q_{i} \geq 0},{i = 1},2,\ldots\mspace{14mu},{M.}} & (77)\end{matrix}$

The above two quadratic functions use a single variable C thatmultiplies all error variables q_(i), i=1, 2, . . . , M. In one example,the data classification module 112 may use different variables c_(i) asweighted factors for each corresponding error variable q_(i), i=1, 2, .. . , M, which can then be represented as:

$\begin{matrix}{{\underset{w,b,h,c_{i},{i = 1},2,\ldots\mspace{14mu},M}{Min}h} + {\sum\limits_{i = 1}^{M}{c_{i} \cdot q_{i}}}} & (78) \\{{h \geq {{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} + q_{i}}},{i = 1},2,\ldots\mspace{14mu},M} & (79) \\{{{{y_{i} \cdot \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack} + q_{i}} \geq 1},{i = 1},2,\ldots\mspace{14mu},M} & (80) \\{{q_{i} \geq 0},{i = 1},2,\ldots\mspace{14mu},{M.}} & (81)\end{matrix}$

For the non-linear separable datasets, the above equations may berepresented as:

$\begin{matrix}{{\underset{w,b,h,c_{i},{i = 1},2,\ldots\mspace{14mu},M}{Min}\mspace{11mu} h} + {\sum\limits_{i = 1}^{M}{c_{i} \cdot q_{i}}}} & (82) \\{{h \geq {{y_{i} \cdot \left\lbrack {{\sum\limits_{i = 1}^{M}{\lambda_{j}{K\left( {x^{i},y^{i}} \right)}}} + b} \right\rbrack} + q_{i}}},{i = 1},2,\ldots\mspace{14mu},M} & (83) \\{{{{y_{i} \cdot \left\lbrack {{\sum\limits_{i = 1}^{M}{\lambda_{j}{K\left( {x^{i},y^{i}} \right)}}} + b} \right\rbrack} + q_{i}} \geq 1},{i = 1},2,\ldots\mspace{14mu},M} & (84) \\{{q_{i} \geq 0},{i = 1},2,\ldots\mspace{14mu},{M.}} & (85)\end{matrix}$

It should be noted that the classifiers as described above, is of low VCdimension and can be obtained for linearly separable and non-linearlyseparable datasets. Furthermore, the classifier is obtained byconsidering non-redundant and only essential characteristics of thetraining data. In such a case, the processing required is less and theprocess of obtaining the classifier is efficient. Furthermore, since thebasis on which the classification is performed also may involve lessnumber of features, the process of classification is fast and moreefficient. Various experimental results are also shared in the followingdescription indicating the increased efficiency in which theclassification is carried out.

The advantages of the present subject matter are provided with referenceto the exemplary implementations are illustrated below. It should alsobe understood that such implementations are not limiting. Otherimplementations based on similar approaches would also be within thescope of the present subject matter.

Exemplary Embodiments

Exemplary implementations in accordance with aspects of the inventionrelating to learning a linear classifier are now discussed withreference to some non-limiting examples. Comparisons are provided withresults obtained using LIBSVM, which is a public domain implementationof Support Vector Machines (SVMs).

TABLE 1 Characteristics of the benchmark datasets used. Dataset Size(samples × features × classes) Blogger  100 × 6 × 2 Fertility diagnosis 100 × 9 × 2 Promoters  106 × 57 × 2 Echocardiogram  132 × 12 × 2Teaching assistant  151 × 5 × 3 Hepatitis  155 × 19 × 2 Hayes  160 × 5 ×3 Plrx  182 × 12 × 2 Seed  210 × 7 × 3 Glass  214 × 10 × 6 Heartstatlog 270 × 13 × 2 Horsecolic  300 × 27 × 2 Haberman  306 × 3 × 2 E coli  336× 8 × 3 House voters  435 × 16 × 2 Wholesale customers  440 × 8 × 2 IPLD 583 × 10 × 2 Balance  625 × 4 × 3 Australian  690 × 14 × 2 Crx  690 ×15 × 2 Transfusion  748 × 5 × 2 Tic tac toe  958 × 9 × 2 Sorlie   85 ×456 × 2 Secom 1567 × 591 ×2 Tian  173 × 12626 × 2

TABLE 2 Present Subject Matter Conventional Art Datasets Accuracy Time(s) Accuracy Time (s) Blogger 69.00 + 17.15 0.0012 + 6.64e−5 58.00 +20.40 6.17 + 2.51 Fertility diagnosis 88.00 + 9.27  0.0013 + 5.27e−586.00 + 9.01  8.12 + 1.47 Promoters 74.08 + 10.88 0.0014 + 4.53e−567.78 + 10.97 0.85 + 0.03 Echocardiogram 90.88 + 5.75  0.0014 + 4.62e−586.38 + 4.50  0.72 + 0.36 Teaching assistant 66.27 + 6.77  0.0013 +3.46e−5 64.94 + 6.56  16.07 + 4.17  Hepatitis 68.38 + 6.26  0.0014 +3.82e−5 60.64 + 7.19  1.90 + 0.56 Hayes 76.32 + 9.25  0.0012 + 2.73e−573.56 + 7.73  7.19 + 3.81 Plrx 71.83 + 7.49  0.0015 + 3.37e−5 71.42 +7.37  4.35 + 0.78 Seed 97.61 + 1.51  0.0015 + 3.97e−5 90.95 + 4.09 12.37 + 4.51  Glass 99.06 + 1.16  0.0042 + 5.56e-3 98.12 + 1.75  11.83 +3.44  Heartstatlog 84.81 + 3.87  0.0018 + 1.47e−5 82.59 + 2.22  9.43 +4.25 Horsecolic 81.00 + 4.03  0.0021 + 7.17e−5 80.26 + 4.63  41.39 +13.93 Haberman 73.89 + 3.71  0.0019 + 4.34e−5 72.56 + 3.73  13.74 +6.63  E coli 96.73 + 1.96  0.0023 + 1.3e−4 96.73 + 1.96  18.41 + 2.57 House voters 95.63 + 1.84  0.0031 + 1.87e−4 94.48 + 2.46  15.77 + 2.19 Wholesale customer 92.26 + 1.97  0.0033 + 1.07e4 91.13 + 1.95  32.11 +8.29  IPLD 71.35 + 2.93  0.0065 + 4e−5 71.35 + 2.93  12.30 + 8.26 Balance 95.26 + 1.02  0.0077 + 1.3e−3 95.20 + 1.01  8.37 + 1.03Australian 85.73 + 2.04  0.0076 + 9.75e−5 84.49 + 1.18  407.97 + 167.73Crx 69.56 + 2.79  0.0095 + 1.36e−3 67.79 + 3.47  498.04 + 35.22 Transfusion 78.19 + 3.25  0.0082 + 8.21e−4 77.13 + 2.26  173.06 + 44.12 Tic tac toe 74.22 + 5.50   0.038 + 4.9e−2 73.91 + 6.11  24.13 + 6.81 Sorlie3 94.084 + 1.54    0.165 + 0.15 90.19 + 2.47  187.50 + 1.37 Secoma 87.87 + 1.88  957.00 + 87.29 86.04 + 0.82  6359.78 + 15.93  Tiana81.71 + 1.43   1.39 + 0.67 80.92 + 1.39  7832.76 + 6.31  

The following data as provided in Table 1, has been obtained by workingwith an implementation of the present subject matter for a linearclassifier, and 17 published datasets, which are amongst benchmark datasets used in the art to compare different classification methods witheach other. The number of data samples and the dimension of each datasetare indicated in the first column of each row alongside the name of thedataset. Test set accuracies are indicated in the format (mean±standarddeviation); these are obtained by using a five fold cross validationmethodology.

In another exemplary implementations in accordance with aspects of theinvention relating to a non-linear classifier, comparisons are providedwith results obtained using LIB-SVM, which is a public domainimplementation of Support Vector Machines (SVMs). Since SVMs areconsidered amongst the state-of-the-art methods in machine learning, thedata provided is indicative of the advantage of the present subjectmatter.

Table 3 provides experimental data from implementations involving anon-linear classifier. A radial basis function has been used for thepurposes of the present implementation. The data provided in Table 3demonstrates the increased accuracy of the present subject matter withrespect to conventional systems.

TABLE 3 Present Subject Matter Conventional Art CPU CPU DatasetsAccuracy time (s) #SV Accuracy time (s) #SV Blogger 88.00 + 4.00 0.32 +0.03 22.20 + 5.91   81.00 + 10.20  2573 + 49.2 51.20 + 3.06 Fertility89.00 + 2.00 0.18 + 0.09  9.80 + 19.60 88.00 + 9.27  8.03 + 1.95 38.20 +1.60 diagnosis Promoters 84.93 + 1.56 0.45 + 0.39 82.40 + 2.73  75.59 +7.63  4.40 + 1.33 83.80 + 0.98 Echocardiogram 89.34 + 4.57 0.31 + 0.0112.00 + 0.00  87.14 + 7.27  8.58 + 1.91 48.00 + 2.10 Teaching 74.83 +2.60 0.39 + 0.13 26.60 + 32.43 68.88 + 6.48 4192 + 162 86.00 + 3.22assistant Hepatitis 85.80 + 8.31 0.44 + 0.02 20.00 + 0.00  82.57 + 6.32 3561 + 4392 72.20 + 4.31 Hayes 81.82 + 7.28 0.31 + 0.05 3.23 + 1.1179.57 + 6.60  1427 + 54.7 84.20 + 2.04 Plrx 71.99 + 5.81 0.41 + 0.104.40 + 8.80 71.41 + 6.04 144.21 + 5816  116.2 + 6.14 Seed 97.13 + 0.950.79 + 0.01 11.20 + 5.71  91.90 + 2.86  3362 + 85.1 51.80 + 1.72 Glass96.23 + 2.77 1.69 + 0.50 36.00 + 11.49 90.64 + 5.09 20 475 + 832  64.80 + 2.40 Heartstatlog 84.44 + 3.21 1.32 + 0.76   10 + 2.23  83.7 +1.54   1547 + 324.52 124.6 + 4.15 Horsecolic 82.33 + 4.03 3.84 + 2.3136.60 + 17.70 81.33 + 4.14 13 267 + 2646  187.2 + 3.27 Haberman 73.49 +3.85 1.23 + 0.32 8.50 + 7.00 72.81 + 3.51 2087 + 750 138.2 + 3.27 Ecoli97.32 + 1.73 3.47 + 0.30 24.00 + 1.41  96.42 + 2.92 11 829 + 248  57.00 + 4.65 House voters 95.87 + 1.16 4.24 + 0.83 17.80 + 8.91  95.42 +2.04 8827 + 349 93.60 + 3.93 Wholesale 92.72 + 1.54 7.31 + 0.93 39.00 +10.64 90.90 + 1.90 9243 + 362 123.40 + 2.15  customer IPLD 72.03 + 3.204.06 + 5.02 23.40 + 30.50 70.15 + 2.24 9743 + 322 311.60 + 5.31  Balance97.64 + 1.32 8.78 + 1.32 14.60 + 0.49  97.60 + 0.51 15 442 + 651  143.00 + 4.23  Australian 85.65 + 2.77 103.45 + 18.04  108.80 + 1.60 84.31 + 3.01 94 207 + 4476  244.8 + 4.64 Crx 69.56 + 2.90 5.95 + 2.553.40 + 6.80 69.27 + 2.62 19 327 + 5841  404.4 + 8.69 Transfusion 77.00 +2.84 7.08 + 0.69 6.00 + 3.52 76.73 + 2.88 18 254 + 1531  302.20 + 7.55 Tic tac toe 98.32 + 0.89 12.55 + 0.56  10.00 + 0.00  93.94 + 2.10 18674 + 973   482.60 + 3.93  Sorlie3 98.82 + 2.35 0.44 + 0.15   50 + 4.7797.644 + 2.88  78.63 + 9.81 68.95 + 3.72 Secoma 94.11 + 2.23 1521 +75.5  382.8 + 44.23 92.29 + 0.82 38 769.25 + 8.87     593.2 + 17.22Tiana 97.09 + 3.83  2.05 + 0.199 70.4 + 3.26 95.188 + 4.26  88.97 + 3.26 75.6 + 1.01

In yet another implementation, an alternate bound for the number ofsupport vectors may be provided as:

$\begin{matrix}{{{E\left( P_{error} \right)} \leq \frac{E\left( {\#{support}\mspace{14mu}{vectors}} \right)}{\#{training}\mspace{14mu}{samples}}},} & (86)\end{matrix}$

In the above Equation, where E(P_(error)) denotes the expected error ontest samples taken from the general distribution, number of trainingsamples denotes the number of training samples, and E(number of supportvectors) denotes the expected number of support vectors obtained ontraining sets of the same size. Although the bound was shown forlinearly separable datasets, it does indicate that the number of supportvectors is also related to the prediction error. An examination of thetable indicates that the proposed approach shows a lower test set error,and also uses a smaller number of support vectors.

Exemplary implementations in accordance with aspects of the inventionrelating to determining salient or discriminative features, are nowdiscussed with reference to some non-limiting examples. The followingdata have been obtained based on 5 published datasets, which are amongstbenchmark data sets used in the art to compare different classificationmethods with each other. The chosen datasets comprise high dimensionaldata. Test set accuracies are indicated in the format mean±standarddeviation; these are obtained by using a five-fold cross validationmethodology. These experimental results demonstrate that the number offeatures relied on by the present subject matter are comparatively lesswhen considered with respect to conventional systems.

TABLE 4 Dataset Features Test Set Accuracy (samples × dimension) MCMReliefF FCBF MCM ReliefF FCBF West (49 × 7129) 32 2207 1802 79.7 ± 6.8%65.3 ± 3.2% 59.19 ± 3.1%  Artificial (100 × 2500) 79 1155 1839 82.1 ±6.1% 80.8 ± 2.9% 80.7 ± 2.2% Cancer (62 × 2000) 48 509 1346 77.3 ± 6.2%74.6 ± 0.9% 75.8 ± 4.1% Khan (63 × 2308) 48 437 897 92.7 ± 1.5% 89.6 ±5.5% 91.3 ± 0.5% Gravier (168 × 2905) 132 1096 1573 83.3 ± 2.6% 84.5 ±1.8% 82.5 ± 1.6% Golub (72 × 7129) 47 2271 7129 95.8 ± 4.2% 90.3 ± 4.8%95.8 ± 4.2% Alon (62 × 2000) 41 896 1984 83.8 ± 3.3% 82.2 ± 7.4% 82.1 ±7.8% Christensen (198 × 1413) 98 633 1413 99.5 ± 0.7% 99.5 ± 0.7% 99.5 ±0.7% Shipp (77 × 7129) 51 3196 7129 96.1 ± 0.1% 93.5 ± 2.1% 93.5 ± 4.4%Singh (102 × 12600) 81 5650 11619 91.2 ± 3.9% 89.2 ± 2.0% 92.5 ± 2.7%

In another aspect, the present invention may be configured as a minimalcomplexity regression system. For this we may consider the case of alinear regressor y=u^(T)x+v. For the present implementation, the samplesmay be considered as fitted by a regressor with zero error, i.e. theregressor lies within an e tube around all training samples. From thelink between classification and regression, we may note that for ahyperplane u^(T)x+v=0, the margin may be the distance of the closestpoint from the hyperplane, which is turn is provided by:

$\underset{{i = 1},2,{\ldots\mspace{14mu} M}}{Min}\frac{{{u^{T}x^{i}} + v}}{u}$We therefore have

$\left( \frac{R}{d} \right)^{2} = \left( \frac{{Max}_{{i = 1},2,\ldots\mspace{14mu},M}{x^{i}}}{{Min}_{{i = 1},2,\ldots\mspace{14mu},M}\frac{{{u^{T}x^{i}} + v}}{u}} \right)^{2}$which may be written as

$\left( \frac{R}{d} \right)^{2} = \left( \frac{{Max}_{{i = 1},2,\ldots\mspace{14mu},M}{u}{x^{i}}}{{Min}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}} \right)^{2}$

For the regression analysis, we also consider a gap tolerant parameter,with a margin d which is greater than and equal to d_(min). With this,we obtain the following Equations:

$\begin{matrix}{{\frac{{{u^{T}x^{i}} + v}}{u} \geq d_{\min}},{i = 1},2,\ldots\mspace{14mu},M} & (5) \\{\left. \Longrightarrow{u} \right. \leq} & (6)\end{matrix}$which in turn provides the following:

$\begin{matrix}{{\underset{{i = 1},2,\ldots\mspace{14mu},M}{Max}{u}{x^{i}}} \leq {\underset{{i = 1},2,\ldots\mspace{14mu},M}{Max}{{{{u^{T}x^{i}} + v}} \cdot \frac{{Max}_{{i = 1},2,\ldots\mspace{14mu},M}{x^{i}}}{d_{\min}}}}} & (7)\end{matrix}$

This can also be represented from the following expression:

$\begin{matrix}{{{\underset{{i = 1},2,\ldots\mspace{14mu},M}{Max}{u}{x^{i}}} \leq {{\beta \cdot \underset{{i = 1},2,\ldots\mspace{14mu},M}{Max}}{{{u^{T}x^{i}} + v}}}},} & (8)\end{matrix}$β is a constant independent of u and v, and dependent only on thedataset and the choice of d_(min).

This provides the following Equation:

$\begin{matrix}{{\underset{u,v}{Minimize}\left( \frac{{Max}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}{{Min}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}} \right)}^{2} = \frac{{Max}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}^{2}}{{Min}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}^{2}}} & (9)\end{matrix}$

Since the RHS of the above Equation (square function) is monotonicallyincreasing, the above result may also be achieved by minimizing thefunction:

$\begin{matrix}\frac{{Max}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}{{Min}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}} & (10)\end{matrix}$

Without loss of generality, the following assumption may be consideredas valid, which in turn provides us with the conclusion that all valuesare non-negative:y _(i) ≥ϵ, I=1,2, . . . ,M  (11)

Since the regressor lies within an ε tube around each of the samples, wehaveu ^(T) x ^(i) +v≥y _(i) −ϵ, i=1,2, . . . ,M  (12)u ^(T) x ^(i) +v≤y _(i) +ϵ, i=1,2, . . . ,M  (13)

Since all function values are non-negative, we have∥u ^(T) x ^(i) v∥=u ^(T) x ^(i) +v, i=1,2, . . . ,M  (14)

Summing up, the data classification module 112 may solve the followingequations to determine the appropriate classifier is obtained by solvingthe following optimization problem.

$\begin{matrix}{\underset{u,v,g,l}{Min}\frac{g}{l}} & (15) \\{{g \geq \left( {{u^{T}x^{i}} + v} \right)},{i = 1},2,\ldots\mspace{14mu},M} & (16) \\{{l \leq \left( {{u^{T}x^{i}} + v} \right)},{i = 1},2,\ldots\mspace{14mu},M} & (17) \\{{\left( {{u^{T}x^{i}} + v} \right) \geq \left( {y_{i} - \epsilon} \right)},{i = 1},2,\ldots\mspace{14mu},M} & (18) \\{{\left( {{u^{T}x^{i}} + v} \right) \leq \left( {y_{i} - \epsilon} \right)},{i = 1},2,\ldots\mspace{14mu},M} & (19)\end{matrix}$

This is a linear fractional programming problem as also discussed inconjunction with the classifier. We apply the Chames-Coopertransformation. This consists of introducing a variable p=1/l, which wesubstitute into (14)-(18) to obtain

$\begin{matrix}{{\underset{u,v,g,p,l}{Min}\mspace{11mu} h} = {g \cdot p}} & (20) \\{{{g \cdot p} \geq \left\lbrack {{{p \cdot u^{T}}x^{i}} + v} \right\rbrack},{i = 1},2,\ldots\mspace{14mu},M} & (21) \\{{{l \cdot p} \leq \left\lbrack {{{p \cdot u^{T}}x^{i}} + v} \right\rbrack},{i = 1},2,\ldots\mspace{14mu},M} & (22) \\{{{p \cdot \left( {{u^{T}x^{i}} + v} \right)} \geq {p \cdot \left( {y_{i} - \epsilon} \right)}},{i = 1},2,\ldots\mspace{14mu},M} & (23) \\{{{p \cdot \left( {{u^{T}x^{i}} + v} \right)} \leq {p \cdot \left( {y_{i} - \epsilon} \right)}},{i = 1},2,\ldots\mspace{14mu},M} & (24) \\{{p \cdot l} = 1} & (25) \\{p \geq 0} & (26)\end{matrix}$

Denoting w=p.u, b=p.v, and noting that p.l=1, we obtain the following:

$\begin{matrix}{\underset{w,b,h,p}{Min}h} & (27) \\{{h \geq \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack},{i = 1},2,\ldots\mspace{14mu},M} & (28) \\{{l \leq \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack},{i = 1},2,\ldots\mspace{14mu},M} & (29) \\{{{w^{T}x^{i}} + b - {p \cdot \left( {y_{i} - \epsilon} \right)}} \geq 0} & (30) \\{{{w^{T}x^{i}} + b - {p \cdot \left( {y_{i} - \epsilon} \right)}} \leq 0} & (31) \\{p \geq 0} & (32)\end{matrix}$which may be written as:

$\begin{matrix}{\underset{w,b,h,p}{Min}h} & (33) \\{{h \geq \left\lbrack {{w^{T}x^{i}} + b} \right\rbrack},{i = 1},2,\ldots\mspace{14mu},M} & (34) \\{{\left\lbrack {{w^{T}x^{i}} + b} \right\rbrack \geq 1},{i = 1},2,\ldots\mspace{14mu},M} & (35) \\{{{w^{T}x^{i}} + b - {p \cdot \left( {y_{i} - \epsilon} \right)}} \geq 0} & (36) \\{{{w^{T}x^{i}} + b - {p \cdot \left( {y_{i} - \epsilon} \right)}} \leq 0} & (37) \\{p \geq 0} & (38)\end{matrix}$

Based on the above Equations, the data classification module 112determines the parameters w, b, and p to provide a regressor as follows:y=u ^(T) x+v,  (39)where

${u = \frac{w}{p}},{and}$ $v = {\frac{b}{p}.}$

In another implementation, we may consider a regression problem withdata points x_(i)=1, 2, . . . , M, and where the value of an unknownfunction at the point xi is denoted by y_(i)∈

. In the present implementation, it should be noted that the task ofbuilding a regressor on this data has a one-to-one correspondence with abinary classification task in which class (−1) points lie at the(n+1)-dimensional co-ordinates (x¹; y₁−ε), (x²; y₂−ε), . . . , (x^(M);y_(M)−ε), and class (+1) points lie at the co-ordinates (x¹; y₁+ε), (x²;y₂+ε), . . . , (x^(M); y_(M)+ε). In the present implementation, it isfirst assumed that these set of points are linearly separable, and welearn the classifier that separates the above training points. For theseparating hyperplane w^(T)x+ηy+b=0. the regressor is given by:

$y = {{- \frac{1}{\eta}}\left( {{w^{T}x} + b} \right)}$

-   -   wherein w is column vector containing n elements, in which the        elements are variables denoted by w₁ w₂, . . . , w_(n) for the        vector w. The vector w is used to define a separating        hyperplane.    -   T: denotes the transposition operation; if u is a column vector        containing n elements    -   n: dimension of the data, i.e. the number of features or        attributes in the data    -   b: is a scalar denoting the offset or bias associated with the        hyperplane w^(T) x+b=0.

From the above, the following Equations follow:

$\underset{w,b,h}{Min}h$h ≥ 1 ⋅ [(w^(T)x^(i) + b) + η(y_(i) + ϵ)], i = 1, 2, …  , M1 ⋅ [(w^(T)x^(i) + b) + η(y_(i) + ϵ)] ≥ 1, i = 1, 2, …  , Mh ≥ −1 ⋅ [(w^(T)x^(i) + b) + η(y_(i) − ϵ)], i = 1, 2, …  , M − 1 ⋅ [(w^(T)x^(i) + b) + η(y_(i) − ϵ)] ≥ 1, i = 1, 2, …  , M

As would be gathered from above, the first two constraints, correspondto class (+1) samples; the multiplier (+1) corresponds to samples withy_(i)=1. Similarly, constraints correspond to class (−1) samples; themultiplier (−1) corresponds to samples with y_(i)=−1. After solving, weobtain w and b. The regressor as shown in the preceding paragraph.

In yet another implementation, when the trade-off parameter C isconsidered, the following Equations may be obtained:

$\begin{matrix}{{\underset{w,b,h,p,q}{Min}h} + {C \cdot {\sum\limits_{i = 1}^{M}\left( {q_{i}^{+} + q_{i}^{-}} \right)}}} & (40) \\{{h \geq {\left\lbrack {{w^{T}x^{i}} + b} \right\rbrack + q_{i}^{+}}},{i = 1},2,\ldots\mspace{14mu},M} & (41) \\{{{\left\lbrack {{w^{T}x^{i}} + b} \right\rbrack - q_{i}^{-}} \geq 1},{i = 1},2,\ldots\mspace{14mu},M} & (42) \\{{\left( {{w^{T}x^{i}} + b} \right) - {p \cdot \left( {y_{i} - \epsilon} \right)} + q_{i}^{-}} \geq 0} & (43) \\{{\left( {{w^{T}x^{i}} + b} \right) - {p \cdot \left( {y_{i} + \epsilon} \right)} - q_{i}^{+}} \leq 0} & (44) \\{{{p \geq 0};}{q_{i}^{+},{q_{i}^{-} \geq 0},{i = 1},2,\ldots\mspace{14mu},M}} & (45)\end{matrix}$

Solving the above equations provide the same regressor function asfollows:y=u ^(T) x+v  (46)

For non-linearly separable datasets, we consider a mapping function ϕ(x)which maps the dataset space to a higher dimension space. Acorresponding notional hyperplane may be represented as follows:y=u ^(T)Ø(x)+v  (47)

Based on similar methodology as adopted for linearly separable datasets,we obtain the following equations:

$\begin{matrix}{{\underset{w,b,h,p,q}{Min}h} + {C \cdot {\sum\limits_{i = 1}^{M}\left( {q_{i}^{+} - q_{i}^{-}} \right)}}} & (48) \\{{h \geq {\left\lbrack {w^{T}{\phi\left( {x^{i} + b} \right)}} \right\rbrack + q_{i}^{+}}},{i = 1},2,\ldots\mspace{14mu},M} & (49) \\{{{\left\lbrack {{w^{T}{\phi\left( x^{i} \right)}} + b} \right\rbrack - q_{i}^{-}} \geq 1},{i = 1},2,\ldots\mspace{14mu},M} & (50) \\{{\left( {{w^{T}{\phi\left( x^{i} \right)}} + b} \right) - {p \cdot \left( {y_{i} - \epsilon} \right)} + q_{i}^{-}} \geq 0} & (51) \\{{\left( {{w^{T}{\phi\left( x^{i} \right)}} + b} \right) - {p \cdot \left( {y_{i} + \epsilon} \right)} - q_{i}^{+}} \leq 0} & (52) \\{{{p \geq 0};}{q_{i}^{+},{q_{i}^{-} \geq 0},{i = 1},2,\ldots\mspace{14mu},M}} & (53)\end{matrix}$

As would be understood from above, the vectors of the mapping functions,i.e., ϕ(x^(i)), i=1, 2, . . . , M form an overcomplete basis in theempirical feature space, in which w also lies. Hence:

$\begin{matrix}{w = {\sum\limits_{j = 1}^{M}{\lambda_{j}{\phi\left( x^{j} \right)}}}} & (54)\end{matrix}$

Therefore,

$\begin{matrix}{{{{w^{T}{\phi\left( x^{i} \right)}} + b} = {{{\sum\limits_{j = 1}^{M}{\lambda_{j}{\phi\left( x^{j} \right)}^{T}{\phi\left( x^{i} \right)}}} + b} = {{\sum\limits_{j = 1}^{M}{\lambda_{j}{K\left( {x^{i},x^{j}} \right)}}} + b}}},} & (55)\end{matrix}$

where K(p, q) denotes the Kernel function with input vectors p and q,and is defined asK(p,q)=Ø(p)^(T)Ø(q).  (56)

Substituting from (54) into (47)-(52), we obtain the followingoptimization problem.

$\begin{matrix}{{\underset{w,b,h,p,q}{Min}h} + {C \cdot {\sum\limits_{i = 1}^{M}\left( {q_{i}^{+} + q_{i}^{-}} \right)}}} & (57) \\{{h \geq {\left\lbrack {{\sum\limits_{j = 1}^{M}{\lambda_{j}{K\left( {x^{i},x^{j}} \right)}}} + b} \right\rbrack + q_{i}^{+}}},{i = 1},2,\ldots\mspace{14mu},M} & (58) \\{{{\left\lbrack {{\sum\limits_{j = 1}^{M}{\lambda_{j}{K\left( {x^{i},x^{j}} \right)}}} + b} \right\rbrack - q_{i}^{-}} \geq 1},{i = 1},2,\ldots\mspace{14mu},M} & (59) \\{{\left( {{\sum\limits_{j = 1}^{M}{\lambda_{j}{K\left( {x^{i},x^{j}} \right)}}} + b} \right) - {p \cdot \left( {y_{i} - \epsilon} \right)} + q_{i}^{-}} \geq 0} & (60) \\{{\left( {{\sum\limits_{j = 1}^{M}{\lambda_{j}{K\left( {x^{i},x^{j}} \right)}}} + b} \right) - {p \cdot \left( {y_{i} - \epsilon} \right)} + q_{i}^{+}} \leq 0} & (61) \\{{{p \geq 0};}{q_{i}^{+},{q_{i}^{-} \geq 0},{i = 1},2,\ldots\mspace{14mu},M}} & (62)\end{matrix}$

Once the variables λ_(j), j=1, 2, . . . , M, b, and p are determined bythe data classification module 112 by solving (56)-(61). On solving thedata classification module 112 obtains the following regressor:

$\begin{matrix}{{{y = {{{u^{T}x} + v} = {{\sum\limits_{i = 1}^{M}{\alpha_{j}{K\left( {x,x^{j}} \right)}}} + v}}},{where}}{{\alpha_{j} = \frac{\lambda_{j}}{p}},{and}}{v = {\frac{b}{p}.}}} & (63)\end{matrix}$

The following Table 4 provides results from an experimentalimplementation indicating comparative performances of systemsimplementing the present subject matter and conventional systems:

TABLE 4 Linear MCM regression results Mean Squared Error Dataset(Dimensions) Present Subject Matter Conventional Systems Autompg (398 ×8) 0.35 ± 0.02 0.36 ± 0.03 Yacht (308 × 7) 104.8 ± 7.5   161.8 ± 68.4 Price (159 × 16) 33.6 ± 12.5 32.8 ± 23.2 (in million dollars) (inmillion dollars) Machine (209 × 7) 6.5368 ± 3.6512 19.948 ± 15.521(thousand units) (thousand units) Baseball (337 × 17) 0.80 ± 0.12 1.62 ±0.61 (in million dollars) (in million dollars) Housing (506 × 13) 23.09± 4.26  25.92 ± 9.61  Energy Efficiency (768 × 8) 8.74 ± 1.35 9.08 ±1.45

TABLE 5 Kernel MCM repression results Mean Squared Error Number ofSupport Present Subject Conventional Present Subject Convention DatasetMatter Systems Matter al Systems Autompg 0.31 ± 0.02 0.32 ± 0.04  26.8 ±7.9 184.2 ± 4.3 Yacht 0.97 ± 0.42 158.86 ± 62.9  129.8 224.8 ± 0.8 Price12.77 ± 9.0  39.48 ± 26.9  68.6 126.4 ± 0.9 (mill. $) (mill. $) Machine7.588 ± 3.909 26.351 ± 21.330  52.4 ± 27.3 166.4 ± 1.5 (th. units) (th.units) Baseball 0.78 ± 0.14 1.78 ± 0.67  24.4 ± 6.8 269.2 ± 1.1 (mill.$) (mill. $) Housing 25.8 ± 4.64 29.72 ± 5.96  76.4 386.8 Energy 4.1 ±0.2 7.64 ± 1.31    44 ± 3.39   557 ± 5.05 Efficiency

Table 5 summarizes five fold cross validation results of the kernel MCMregressor on a number of datasets. The width of the Gaussian kernel waschosen by using a grid search. The table shows the mean squared errorand the number of support vectors for both the kernel MCM and theclassical SVM with a Gaussian kernel. The results indicate that thekernel MCM yields better generalization than the SVM. In the case ofkernel regression, the MCM uses fewer support vectors —note that in thecase of some of the datasets, the MCM uses less than one-tenth thenumber of support vectors required by a SVM. The large difference withthe SVM results indicates that despite good performance, SVMs may stillbe far from the optimal solution.

As would be understood from above, the system implementing the presentsubject matter utilized less number of kernels, thereby reducing theoverall computing resources which would be required for dataclassification and also reducing the mean error in classification. Thiswould result in an increase in the accuracy of the system for dataclassification. As would be understood, the present subject matterprovides more efficient systems and methods for data classification,when considered with respect to the conventional systems known in theart.

FIG. 2 illustrates method 200 for classifying data based on a maximummargin classifier having a low VC dimension. The order in which themethod is described is not intended to be construed as a limitation, andany number of the described method blocks may be combined in any orderto implement the aforementioned methods, or an alternative method.Furthermore, method 200 may be implemented by processing resource orcomputing device(s) through any suitable hardware, non-transitorymachine readable instructions, or combination thereof.

It may also be understood that method 200 may be performed by programmedcomputing devices, such as the data classification system 100 asdepicted in FIGS. 1 and 2. Furthermore, the method 200 may be executedbased on instructions stored in a non-transitory computer readablemedium, as will be readily understood. The non-transitory computerreadable medium may include, for example, digital memories, magneticstorage media, such as one or more magnetic disks and magnetic tapes,hard drives, or optically readable digital data storage media. Although,the method 200 is described below with reference to the dataclassification system 100 as described above, other suitable systems forthe execution of these methods can be utilized. Additionally,implementation of these methods is not limited to such examples.

At block 202, training data having a predefined sample size is obtained.In one implementation, the training data is composed of separabledatasets. The training data may either be linearly or non-linearlyseparable. In another implementation, the training data may be obtainedby the data classification module 112 from the training data 116.

At block 204, a Vapnik-Chervonenkis (VC) dimension is determined for thetraining data. For example, the VC dimension may be determined by thedata classification module 112. As would be understood, VC dimensionattempts to generalize one or more conditions based on the trainingdata. The VC dimension may be considered as indicative of a capacity ofa classification approach or the complexity of the system underconsideration.

At block 206, an exact bound of the VC dimension is determined. Forexample, the data classification module 112 may determine the exactbound of the VC dimension. In one implementation, the exact bound for alinearly separable data set is provided by the following relation:

$h = \frac{{Max}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}{{Min}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}$

wherein, x^(i), i=1, 2, . . . , M depict data points within the trainingdata

At block 208, the exact bound is minimized to obtain the classifier. Forexample, the data classification module 112 may minimize the exact boundof the VC dimension to obtain the classifier. In the present example,the data classification module 112 may minimize the following function:

$\begin{matrix}{\underset{u,v}{Minimize}\frac{{Max}_{{u;{i = 1}},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}{{Min}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}} & (10)\end{matrix}$

for a notional hyperplane which classifies plurality of points withinthe training data with zero error, represented asu ^(T) x+v=0

At block 210, the classifier is generated based on minimized exact boundfor predicting at least one class to which samples of the training databelong. In one implementation, the data classification module 112generates the classifier of classification of data.

Although examples for the present disclosure have been described inlanguage specific to structural features and/or methods, it should beunderstood that the appended claims are not necessarily limited to thespecific features or methods described. Rather, the specific featuresand methods are disclosed and explained as examples of the presentdisclosure.

I claim:
 1. A method for classifying binary data, the method comprising:obtaining training data having a predefined sample size, wherein thetraining data is composed of separable binary datasets; determining anexact bound on Vapnik-Chervonenkis (VC) dimension of a classifier forthe training data, wherein the exact bound is based one or morevariables defining the hyperplane; and minimizing the exact bound on theVC dimension; based on the minimizing of the exact bound, determiningoptimal values of the one or more variables defining the hyperplane; andgenerating a classifier, based on minimized exact bound, for predictingone class to which a given data sample of the training data belongs,wherein the exact bound is a function of the distances of closest andfurthest from amongst the training data from the hyperplane, wherein thehyperplane classifies plurality of points within the training data withzero error, and wherein for a notional hyperplane depicted by thefollowing relation:u ^(T) x+v=0, the exact bound on the VC dimension for the hyperplane isa function of h, being defined by:$h = \frac{{Max}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}{{Min}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}$wherein, x^(i), i=1, 2, . . . , M depict data points within trainingdata.
 2. The method as claimed in claim 1, wherein the binary datasetsare one of linearly separable datasets and non-linearly separabledatasets.
 3. The method as claimed in claim 1, wherein the function tobe minimized is another function of h added to a misclassification errorparameter.
 4. The method as claimed in claim 1, wherein the minimizingthe exact bound further comprises: reducing the linear fractionalprogramming problem of minimizing the h to obtain a linear programmingproblem; by solving the linear programming problem so obtained,obtaining a decision function for classifying the test data.
 5. Themethod as claimed in claim 4, wherein the decision function has a low VCdimension.
 6. The method as claimed in claim 4, wherein the objective ofthe linear programming problem includes a function of themisclassification error.
 7. A system for classifying test data, thesystem comprising: a processor; and a data classification nodule,wherein the data classification module is to, obtaining training datahaving a predefined sample size, wherein the training data is composedof separable binary datasets having; determining an exact bound on theVapnik-Chervonenkis (VC) dimension of a hyperplane for the trainingdata, wherein the exact bound depends on one or more variables definingthe hyperplane minimizing the exact bound on the VC dimension based onthe minimizing of the exact bound, determining optimal values of the oneor more variables defining the hyperplane; and generating a classifierbased on the minimized exact bound for predicting one class to which agiven data sample of the training data belongs, wherein the exact boundis a function of the distances of closest and furthest from amongst thetraining data from the hyperplane, wherein the hyperplane classifiesplurality of points within the training data with zero error; andwherein for a notional hyperplane depicted by the following relation:u ^(T) x+v=0, the exact bound on the VC dimension for the hyperplane isa function of h, being defined by:$h = \frac{{Max}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}{{Min}_{{i = 1},2,\ldots\mspace{14mu},M}{{{u^{T}x^{i}} + v}}}$wherein, x^(i), i=1, 2, . . . , M depict data points within trainingdata.
 8. The system as claimed in claim 7, wherein the dataclassification module for nonlinearly separable datasets in a firstdimension, is to map samples of the training data from the firstdimension to a higher dimension using a mapping function Φ.
 9. Thesystem as claimed in claim 7, wherein for the notional hyperplanedepicted by the relation u^(T) Φ(x)+v=0, the data classification moduleis to: minimize an exact bound on the VC dimension of the hyperplanewherein the said classifier separates samples that have been transformedfrom the input dimension to a higher dimension by means of the mappingfunction (Φ); wherein the minimization task is achieved by solving afractional programming problem that has been reduced to a linearprogramming problem.
 10. The system as claimed in claim 7, where dataclassification module utilizes a Kernel function K, wherein, K is afunction of two input vectors ‘a’ and ‘b’, with K being positivedefinite; and K(a,b)=Φ(a)^(T) Φ(b) with K(a,b) being an inner product ofthe vectors obtained by transforming vectors ‘a’ and ‘b’ into a higherdimensional space by using the mapping function Φ.
 11. The system asclaimed in claim 7, wherein alternatively the data classification moduleis to further: obtain a tolerance regression parameter, for a pluralityof points within the training data; obtain the value of a hypotheticalfunction or measurement at each of said training samples derive aclassification problem in which the samples of each of the two classesare determined by using the given data and the tolerance parameterdefine a notional hyperplane, wherein the notional hyperplane classifiesthe plurality of points within the derived classification problem withminimal error; and based on the notional hyperplane, generates aregressor corresponding to the plurality of points.
 12. The system asclaimed in claim 11, wherein for the notional hyperplane is defined by,w^(T)x+ηy+b=0, the data classification module generates the regressordefined by, $y = {{- \frac{1}{\eta}}\left( {{w^{T}x} + b} \right)}$ 13.The system as claimed in claim 12, wherein for the points forming alinearly separable dataset, the regressor is a linear regressor.
 14. Thesystem as claimed in claim 12, wherein for the points forming anonlinearly separable dataset, the regressor is a kernel regressor. 15.The system as claimed in claim 12, wherein the regressor furtherincludes an error parameter.
 16. The method as claimed in claim 6, inwhich the solution of the linear programming problem yields a set ofweights or co-efficients, with each weight corresponding to an inputfeature, attribute, or co-ordinate, and wherein the set of inputfeatures with non-zero weights constitutes a set of selected features toallow feature selection.
 17. The method as claimed in claim 16, in whichonly the selected features are used to next compute a classifier, thuseliminating the noise or confusion introduced by features that are lessdiscriminative.
 18. The method as claimed in claim 3, in which theconstraints are modified so that one of the terms of the objectivefunction is non-essential and can be removed.
 19. The method as claimedin claim 18, in which the removal of a term in the objective functionremoves the need to choose a hyper-parameter weighting themisclassification error, thus simplifying the use of the said method.20. The method as claimed in claim 15, in which the constraints aremodified so that one of the terms of the objective function isnon-essential and can be removed.
 21. The method as claimed in claim 1,wherein the Max function is replaced by a “soft Max” function in whichdistance is measured as a weighted function of distances from aplurality of hyperplanes, and in which the Min function is replaced by a“soft Min” function.
 22. The system as claimed in claim 12, in which theMax function is replaced by a “soft Max” function in which distance ismeasured as a weighted function of distances from a plurality ofhyperplanes, and in which the Min function is replaced by a “soft Min”function.